{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个教程中大部分时间都是使用pandas.read_csv来读取数据，所以了解pytohn如何处理文件是很重要的。\n",
    "\n",
    "用内建的open函数能打开、读取、写入一个文件，要给open一个相对路径或绝对路径："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "path = '../examples/segismundo.txt'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上面的`..`表示返回上一个层级"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "f = open(path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for line in f:\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "文件中的每一个line都是有end-of-line（EOL），行终结标志的，所以我们经常用下面的方法都到没有行终结标志的list of line："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "lines = [x.rstrip() for x in open(path)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Sueña el rico en su riqueza,',\n",
       " 'que más cuidados le ofrece;',\n",
       " '',\n",
       " 'sueña el pobre que padece',\n",
       " 'su miseria y su pobreza;',\n",
       " '',\n",
       " 'sueña el que a medrar empieza,',\n",
       " 'sueña el que afana y pretende,',\n",
       " 'sueña el que agravia y ofende,',\n",
       " '',\n",
       " 'y en el mundo, en conclusión,',\n",
       " 'todos sueñan lo que son,',\n",
       " 'aunque ninguno lo entiende.',\n",
       " '']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当我们open文件后，一定要记得在结束后close文件，这样就不会占用操作系统的资源了："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "另一种方法可以在运行结束后自动关闭文件，用with语句。其实这样的用法更常见，推荐用with："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with open(patn) as f:\n",
    "    lines = [x.rstrip() for x in f]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如何我们输入`f = open(path, 'w')`，那么会在`../examples/`下创建一个新的segismundo.txt文件，如果有同名文件的话会被覆盖。w就表示写入的意思。\n",
    "\n",
    "即打开文件的同时就规定好我们是以哪种模式打开的，下面就是一些有效的打开模式：\n",
    "\n",
    "模式 | 描述\n",
    "---- | ---\n",
    "r | 单一读取模式\n",
    "w | 单一写入模式；创建一个新文件（消除同名文件的数据）\n",
    "x | 单一写入模式；创建一个新文件，但如果path已经存在的话会失败\n",
    "a | 添加内容到已经存在的文件里（如果文件不存在的话创建新文件）\n",
    "r+ | 读取和写入双模式\n",
    "b | 针对二进制文件的模式（'rb': 或 'wb'）\n",
    "t | 文本模式（自动编码bytes为Unicode）。如果不明说的话默认就是这种模式。添加t到其他模式后面（'rt' or 'xt'）\n",
    "\n",
    "这里有一张表格会更全一些：\n",
    "\n",
    "![](../MarkdownPhotos/chp03/屏幕快照 2017-10-24 上午9.38.24.png)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "想要写文本到文件里的话，用file的write或writelines方法。比如，我们想给prof_mod.py写一个没有空白行的版本："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with open('../examples/tmp.txt', 'w') as handle:\n",
    "    handle.writelines(x for x in open(path) if len(x) > 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Sueña el rico en su riqueza,\\n',\n",
       " 'que más cuidados le ofrece;\\n',\n",
       " 'sueña el pobre que padece\\n',\n",
       " 'su miseria y su pobreza;\\n',\n",
       " 'sueña el que a medrar empieza,\\n',\n",
       " 'sueña el que afana y pretende,\\n',\n",
       " 'sueña el que agravia y ofende,\\n',\n",
       " 'y en el mundo, en conclusión,\\n',\n",
       " 'todos sueñan lo que son,\\n',\n",
       " 'aunque ninguno lo entiende.\\n']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open('../examples/tmp.txt') as f:\n",
    "    lines = f.readlines()\n",
    "lines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面的方法，有需要再去看吧：\n",
    "\n",
    "![](../MarkdownPhotos/chp03/屏幕快照 2017-10-24 上午9.48.27.png)\n",
    "\n",
    "\n",
    "![](../MarkdownPhotos/chp03/屏幕快照 2017-10-24 上午9.50.36.png)\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bytes and Unicode with Files\n",
    "\n",
    "不论是读取还是写入，默认的python文件都是 text mode（文本模式），意味着你是与python string（i.e., Unicode）打交道。这和binary mode（二进制模式）形成了对比。这里举个栗子（下面的文件包含non-ASCII字符，用UTF-8编码）："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with open(path) as f:\n",
    "    chars = f.read(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Sueña el r'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chars"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "UTF-8是一种长度可变的Unicode编码，所以我们想要从文件中读取一定数量的字符时，python会读取足够的bytes（可能从10到40）然后解码城我们要求数量的字符。而如果我们用'rb'模式的话，read只会读取相应的bytes数量："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "with open(path, 'rb') as f:\n",
    "    data = f.read(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "b'Sue\\xc3\\xb1a el '"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "取决于文本的编码，你能够把bytes解码为str，不过如果编码的Unicode字符不完成的话，是无法解码的："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Sueña el '"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.decode('utf8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "UnicodeDecodeError",
     "evalue": "'utf-8' codec can't decode byte 0xc3 in position 3: unexpected end of data",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mUnicodeDecodeError\u001b[0m                        Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-21-300e0af10bb7>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'utf8'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mUnicodeDecodeError\u001b[0m: 'utf-8' codec can't decode byte 0xc3 in position 3: unexpected end of data"
     ]
    }
   ],
   "source": [
    "data[:4].decode('utf8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在使用open的时候，文本模式是有一个编码选项的，这能更方便我们把一种Unicode编码变为另一种："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sink_path = '../examples/sink.txt'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with open(path) as source:\n",
    "    with open(sink_path, 'xt', encoding='iso-8859-1') as sink:\n",
    "        sink.write(source.read())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sueña el r\n"
     ]
    }
   ],
   "source": [
    "with open(sink_path, encoding='iso-8859-1') as f:\n",
    "    print(f.read(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注意：在任何模式下使用seek打开文件都可以，除了二进制模式。如果文件的指针落在bytes（Unicode编码）的中部，那么之后使用read会报错："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "f = open(path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Sueña'"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f.read(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f.seek(4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "UnicodeDecodeError",
     "evalue": "'utf-8' codec can't decode byte 0xb1 in position 0: invalid start byte",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mUnicodeDecodeError\u001b[0m                        Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-28-7841103e33f5>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m/Users/xu/anaconda/envs/py35/lib/python3.5/codecs.py\u001b[0m in \u001b[0;36mdecode\u001b[0;34m(self, input, final)\u001b[0m\n\u001b[1;32m    319\u001b[0m         \u001b[0;31m# decode input (taking the buffer into account)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    320\u001b[0m         \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuffer\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 321\u001b[0;31m         \u001b[0;34m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconsumed\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_buffer_decode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merrors\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfinal\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    322\u001b[0m         \u001b[0;31m# keep undecoded input until the next call\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    323\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuffer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mconsumed\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mUnicodeDecodeError\u001b[0m: 'utf-8' codec can't decode byte 0xb1 in position 0: invalid start byte"
     ]
    }
   ],
   "source": [
    "f.read(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果我们在做数据分析时，经常使用non-ASCII文本数据的话，掌握python的Unicode功能是很有用的。\n",
    "\n",
    "关于字符串和编码，这里推荐一篇文章，写的非常详细：[字符串，那些你不知道的事](http://liujiacai.net/blog/2015/11/20/strings/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [py35]",
   "language": "python",
   "name": "Python [py35]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
